AITopics

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.14)
Europe > Romania (0.04)
Europe > United Kingdom > England (0.04)
(19 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology (1.00)
Education (1.00)
Banking & Finance (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-8-2026, 08:06:02 GMT

3df07fdae1ab273a967aaa1d355b8bb6-Paper.pdf

dataset, target language, training data, (16 more...)

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.66)

Neural Information Processing SystemsFeb-7-2026, 17:55:22 GMT

0c007ebef1d11fd48da6ce4f54687db6-Paper-Datasets_and_Benchmarks.pdf

dataset, modality, template, (16 more...)

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Israel (0.04)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Health Care Technology (0.69)
Health & Medicine > Nuclear Medicine (0.68)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

Gupta, Raavi, Panicker, Pranav Hari, Bhatia, Sumit, Ramakrishnan, Ganesh

Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

arXiv.org Artificial IntelligenceNov-18-2025

Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.

large language model, machine learning, natural language, (19 more...)

2511.12236

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre:

Research Report (1.00)
Overview > Fact Book (0.43)

Industry:

Leisure & Entertainment > Sports (0.47)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Song, Jongyoon, Yu, Sangwon, Yoon, Sungroh

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

arXiv.org Artificial IntelligenceNov-17-2025

Abstract--Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs. ECENT advances in the capabilities and emergent abilities of large language models (LLMs) have led to rapid improvements in the performance of a wide range of natural language processing (NLP) tasks [1]-[5]. Leveraging their ability to follow instructions, LLMs are able to perform complex, previously unseen tasks, enabling human-like interactions [6]-[9]. One critical issue is the hallucination problem, where the model generates content that contains misleading information, which does not correspond to the given context or real-world knowledge [11]. J. Song was with the Department of Electrical and Computer Engineering at Seoul National University, South Korea (coms1580@gmail.com).

large language model, machine learning, natural language, (17 more...)

doi: 10.1109/TASLPRO.2025.3633059

2511.10881

Country:

North America > United States > Minnesota (0.28)
Asia > South Korea > Seoul > Seoul (0.24)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Manvattira, Aman Ganapathy, Xu, Yifei, Dang, Ziyue, Lu, Songwu

DeepSpecs: Expert-Level Questions Answering in 5G

arXiv.org Artificial IntelligenceNov-4-2025

5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.

large language model, machine learning, specification, (16 more...)

2511.01305

Country: North America > United States > California (0.28)

Genre: Research Report (0.83)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

arXiv.org Artificial IntelligenceOct-24-2025

Benchmarking GPT-5 for biomedical natural language processing

Hou, Yu, Zhan, Zaifu, Zeng, Min, Wu, Yifan, Zhou, Shuang, Zhang, Rui

Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5 achieved better chemical NER and ChemProt scores but remained below domain-tuned baselines for disease NER and summarization. Despite producing longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower effective cost per correct prediction. Fine-grained analyses revealed improvements in diagnosis, treatment, and reasoning subtypes, whereas boundary-sensitive extraction and evidence-dense summarization remain challenging. Overall, GPT-5 approaches deployment-ready performance for biomedical QA while offering a favorable balance of accuracy, interpretability, and economic efficiency. The results support a tiered prompting strategy: direct prompting for large-scale or cost-sensitive applications, and chain-of-thought scaffolds for analytically complex or high-stakes scenarios, highlighting the continued need for hybrid solutions where precision and factual fidelity are critical.

large language model, machine learning, natural language, (22 more...)

2509.04462

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-11-2025, 00:11:52 GMT

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

In this work, we introduce ChatQA, a suite of models that outperform GPT -4 on retrieval-augmented generation (RAG) and conversational question answering (QA).

chatqa-1, dataset, instruction, (14 more...)